Theano, Lasagne

and why they matter

got no lasagne?

Install the bleeding edge version from here: http://lasagne.readthedocs.org/en/latest/user/installation.html

Warming up

Implement a function that computes the sum of squares of numbers from 0 to N
Use numpy or python
An array of numbers 0 to N - numpy.arange(N)



In [ ]:

    
import numpy as np


def sum_squares(N):
    return <YOUR CODE: student.implement_me()>



In [ ]:

    
%%time
sum_squares(10**8)

theano teaser

Doing the very same thing



In [ ]:

    
import theano
import theano.tensor as T



In [ ]:

    
# I gonna be function parameter
N = T.scalar("a dimension", dtype='int32')


# i am a recipe on how to produce sum of squares of arange of N given N
result = (T.arange(N)**2).sum()

# Compiling the recipe of computing "result" given N
sum_function = theano.function(inputs=[N], outputs=result)



In [ ]:

    
%%time
sum_function(10**8)

How does it work?

1 You define inputs f your future function;
2 You write a recipe for some transformation of inputs;
3 You compile it;
You have just got a function!
The gobbledegooky version: you define a function as symbolic computation graph.

There are two main kinвs of entities: "Inputs" and "Transformations"
Both can be numbers, vectors, matrices, tensors, etc.
Both can be integers, floats of booleans (uint8) of various size.

An input is a placeholder for function parameters.
- N from example above

Transformations are the recipes for computing something given inputs and transformation
- (T.arange(N)^2).sum() are 3 sequential transformations of N
- Doubles all functions of numpy vector syntax
- You can almost always go with replacing "np.function" with "T.function" aka "theano.tensor.function"
  - np.mean -> T.mean
  - np.arange -> T.arange
  - np.cumsum -> T.cumsum
  - and so on.
  - builtin operations also work that way
  - np.arange(10).mean() -> T.arange(10).mean()
  - Once upon a blue moon the functions have different names or locations (e.g. T.extra_ops)
    - Ask us or google it

Still confused? We gonna fix that.



In [ ]:

    
# Inputs
example_input_integer = T.scalar("scalar input", dtype='float32')

# dtype = theano.config.floatX by default
example_input_tensor = T.tensor4("four dimensional tensor input")
# не бойся, тензор нам не пригодится


input_vector = T.vector("my vector", dtype='int32')  # vector of integers



In [ ]:

    
# Transformations

# transofrmation: elementwise multiplication
double_the_vector = input_vector*2

# elementwise cosine
elementwise_cosine = T.cos(input_vector)

# difference between squared vector and vector itself
vector_squares = input_vector**2 - input_vector



In [ ]:

    
# Practice time:
# create two vectors of size float32
my_vector = student.init_float32_vector()
my_vector2 = student.init_one_more_such_vector()



In [ ]:

    
# Write a transformation(recipe):
#(vec1)*(vec2) / (sin(vec1) +1)
my_transformation = student.implementwhatwaswrittenabove()



In [ ]:

    
print(my_transformation)
# it's okay it aint a number



In [ ]:

    
# What's inside the transformation
theano.printing.debugprint(my_transformation)

Compiling

So far we were using "symbolic" variables and transformations
- Defining the recipe for computation, but not computing anything
To use the recipe, one should compile it



In [ ]:

    
inputs = [ <YOUR CODE: two vectors that my_transformation depends on> ]
outputs = [ <YOUR CODE: what we compute (can be a list of several transformation)> ]

# The next lines compile a function that takes two vectors and computes your transformation
my_function = theano.function(
    inputs, outputs,
    # automatic type casting for input parameters (e.g. float64 -> float32)
    allow_input_downcast=True
)



In [ ]:

    
# using function with, lists:
print "using python lists:"
print my_function([1, 2, 3], [4, 5, 6])
print

# Or using numpy arrays:
# btw, that 'float' dtype is casted to secong parameter dtype which is float32
print "using numpy arrays:"
print my_function(np.arange(10),
                  np.linspace(5, 6, 10, dtype='float'))

Debugging

Compilation can take a while for big functions
To avoid waiting, one can evaluate transformations without compiling
Without compilation, the code runs slower, so consider reducing input size



In [ ]:

    
# a dictionary of inputs
my_function_inputs = {
    my_vector: [1, 2, 3],
    my_vector2: [4, 5, 6]
}

# evaluate my_transformation
# has to match with compiled function output
print my_transformation.eval(my_function_inputs)


# can compute transformations on the fly
print("add 2 vectors", (my_vector + my_vector2).eval(my_function_inputs))

#!WARNING! if your transformation only depends on some inputs,
# do not provide the rest of them
print("vector's shape:", my_vector.shape.eval({
    my_vector: [1, 2, 3]
}))

When debugging, it's usually a good idea to reduce the scale of your computation. E.g. if you train on batches of 128 objects, debug on 2-3.
If it's imperative that you run a large batch of data, consider compiling with mode='debug' instead

Your turn: Mean Squared Error (2 pts)



In [ ]:

    
# Quest #1 - implement a function that computes a mean squared error of two input vectors
# Your function has to take 2 vectors and return a single number

<YOUR CODE: student.define_inputs_and_transformations()>

compute_mse = <YOUR CODE: student.compile_function()>



In [ ]:

    
# Tests
from sklearn.metrics import mean_squared_error

for n in [1, 5, 10, 10**3]:

    elems = [np.arange(n), np.arange(n, 0, -1), np.zeros(n),
             np.ones(n), np.random.random(n), np.random.randint(100, size=n)]

    for el in elems:
        for el_2 in elems:
            true_mse = np.array(mean_squared_error(el, el_2))
            my_mse = compute_mse(el, el_2)
            if not np.allclose(true_mse, my_mse):
                print('Wrong result:')
                print('mse(%s,%s)' % (el, el_2))
                print("should be: %f, but your function returned %f" %
                      (true_mse, my_mse))
                raise ValueError("Что-то не так")

print("All tests passed")

Shared variables

The inputs and transformations only exist when function is called
Shared variables always stay in memory like global variables
- Shared variables can be included into a symbolic graph
- They can be set and evaluated using special methods
  - but they can't change value arbitrarily during symbolic graph computation
  - we'll cover that later;

Hint: such variables are a perfect place to store network parameters
- e.g. weights or some metadata



In [ ]:

    
# creating shared variable
shared_vector_1 = theano.shared(np.ones(10, dtype='float64'))



In [ ]:

    
# evaluating shared variable (outside symbolicd graph)
print("initial value", shared_vector_1.get_value())

# within symbolic graph you use them just as any other inout or transformation, not "get value" needed



In [ ]:

    
# setting new value
shared_vector_1.set_value(np.arange(5))

# getting that new value
print("new value", shared_vector_1.get_value())

# Note that the vector changed shape
# This is entirely allowed... unless your graph is hard-wired to work with some fixed shape

Your turn



In [ ]:

    
# Write a recipe (transformation) that computes an elementwise transformation of shared_vector and input_scalar
# Compile as a function of input_scalar

input_scalar = T.scalar('coefficient', dtype='float32')

scalar_times_shared = <YOUR CODE: student.write_recipe()>


shared_times_n = <YOUR CODE: student.compile_function()>



In [ ]:

    
print "shared:", shared_vector_1.get_value()

print "shared_times_n(5)", shared_times_n(5)

print "shared_times_n(-0.5)", shared_times_n(-0.5)



In [ ]:

    
# Changing value of vector 1 (output should change)
shared_vector_1.set_value([-1, 0, 1])
print "shared:", shared_vector_1.get_value()

print "shared_times_n(5)", shared_times_n(5)

print "shared_times_n(-0.5)", shared_times_n(-0.5)

T.grad - why theano matters

Theano can compute derivatives and gradients automatically
Derivatives are computed symbolically, not numerically

Limitations:

You can only compute a gradient of a scalar transformation over one or several scalar or vector (or tensor) transformations or inputs.
A transformation has to have float32 or float64 dtype throughout the whole computation graph
- derivative over an integer has no mathematical sense



In [ ]:

    
my_scalar = T.scalar(name='input', dtype='float64')

scalar_squared = T.sum(my_scalar**2)

# a derivative of v_squared by my_vector
derivative = T.grad(scalar_squared, my_scalar)

fun = theano.function([my_scalar], scalar_squared)
grad = theano.function([my_scalar], derivative)



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline


x = np.linspace(-3, 3)
x_squared = list(map(fun, x))
x_squared_der = list(map(grad, x))

plt.plot(x, x_squared, label="x^2")
plt.plot(x, x_squared_der, label="derivative")
plt.legend()

Why that rocks



In [ ]:

    
my_vector = T.vector('float64')

# Compute the gradient of the next weird function over my_scalar and my_vector
# warning! Trying to understand the meaning of that function may result in permanent brain damage

weird_psychotic_function = ((my_vector+my_scalar)**(1+T.var(my_vector)) + 1./T.arcsinh(my_scalar)).mean()/(my_scalar**2 + 1) + 0.01*T.sin(2*my_scalar**1.5)*(
    T.sum(my_vector) * my_scalar**2)*T.exp((my_scalar-4)**2)/(1+T.exp((my_scalar-4)**2))*(1.-(T.exp(-(my_scalar-4)**2))/(1+T.exp(-(my_scalar-4)**2)))**2


der_by_scalar, der_by_vector = <YOUR CODE: student.compute_grad_over_scalar_and_vector()>


compute_weird_function = theano.function(
    [my_scalar, my_vector], weird_psychotic_function)
compute_der_by_scalar = theano.function([my_scalar, my_vector], der_by_scalar)



In [ ]:

    
# Plotting your derivative
vector_0 = [1, 2, 3]

scalar_space = np.linspace(0, 7)

y = [compute_weird_function(x, vector_0) for x in scalar_space]
plt.plot(scalar_space, y, label='function')
y_der_by_scalar = [compute_der_by_scalar(x, vector_0) for x in scalar_space]
plt.plot(scalar_space, y_der_by_scalar, label='derivative')
plt.grid()
plt.legend()

Almost done - Updates

updates are a way of changing shared variables at after function call.
technically it's a dictionary {shared_variable : a recipe for new value} which is has to be provided when function is compiled

That's how it works:



In [ ]:

    
# Multiply shared vector by a number and save the product back into shared vector

inputs = [input_scalar]
outputs = [scalar_times_shared]  # return vector times scalar

my_updates = {
    # and write this same result bach into shared_vector_1
    shared_vector_1: scalar_times_shared
}

compute_and_save = theano.function(inputs, outputs, updates=my_updates)



In [ ]:

    
shared_vector_1.set_value(np.arange(5))

# initial shared_vector_1
print("initial shared value:", shared_vector_1.get_value())

# evaluating the function (shared_vector_1 will be changed)
print("compute_and_save(2) returns", compute_and_save(2))

# evaluate new shared_vector_1
print("new shared value:", shared_vector_1.get_value())

Logistic regression example (4 pts)

Implement the regular logistic regression training algorithm

Tips:

Weights fit in as a shared variable
X and y are potential inputs
Compile 2 functions:
- train_function(X,y) - returns error and computes weights' new values (through updates)
- predict_fun(X) - just computes probabilities ("y") given data

We shall train on a two-class MNIST dataset

please note that target y are {0,1} and not {-1,1} as in some formulae



In [ ]:

    
from sklearn.datasets import load_digits
mnist = load_digits(2)

X, y = mnist.data, mnist.target


print("y [shape - %s]:" % (str(y.shape)), y[:10])

print("X [shape - %s]:" % (str(X.shape)))
print(X[:3])
print(y[:10])



In [ ]:

    
# inputs and shareds
shared_weights = <YOUR CODE: student.code_me()
input_X = <YOUR CODE: student.code_me()>
input_y = <YOUR CODE: student.code_me()>



In [ ]:

    
predicted_y = <YOUR CODE: predicted probabilities for input_X>


loss = <YOUR CODE: logistic loss(scalar, mean over sample)>


grad = <YOUR CODE: gradient of loss over model weights>


updates = {
    shared_weights: <YOUR CODE: new weights after gradient step>
}



In [ ]:

    
train_function = <YOUR CODE: compile function that takes X and y, returns log loss and updates weights>

predict_function = <YOUR CODE: compile function that takes X and computes probabilities of y>



In [ ]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)



In [ ]:

    
from sklearn.metrics import roc_auc_score

for i in range(5):
    loss_i = train_function(X_train, y_train)
    print("loss at iter %i:%.4f" % (i, loss_i))
    print("train auc:", roc_auc_score(y_train, predict_function(X_train)))
    print("test auc:", roc_auc_score(y_test, predict_function(X_test)))


print("resulting weights:")
plt.imshow(shared_weights.get_value().reshape(8, -1))
plt.colorbar()

lasagne

lasagne is a library for neural network building and training
it's a low-level library with almost seamless integration with theano

For a demo we shall solve the same digit recognition problem, but at a different scale

images are now 28x28
10 different digits
50k samples



In [ ]:

    
from mnist import load_dataset
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()

print X_train.shape, y_train.shape



In [ ]:

    
import lasagne

input_X = T.tensor4("X")

# input dimention (None means "Arbitrary" and only works at  the first axes [samples])
input_shape = [None, 1, 28, 28]

target_y = T.vector("target Y integer", dtype='int32')

Defining network architecture



In [ ]:

    
# Input layer (auxilary)
input_layer = lasagne.layers.InputLayer(shape=input_shape, input_var=input_X)

# fully connected layer, that takes input layer and applies 50 neurons to it.
# nonlinearity here is sigmoid as in logistic regression
# you can give a name to each layer (optional)
dense_1 = lasagne.layers.DenseLayer(input_layer, num_units=50,
                                    nonlinearity=lasagne.nonlinearities.sigmoid,
                                    name="hidden_dense_layer")

# fully connected output layer that takes dense_1 as input and has 10 neurons (1 for each digit)
# We use softmax nonlinearity to make probabilities add up to 1
dense_output = lasagne.layers.DenseLayer(dense_1, num_units=10,
                                         nonlinearity=lasagne.nonlinearities.softmax,
                                         name='output')



In [ ]:

    
# network prediction (theano-transformation)
y_predicted = lasagne.layers.get_output(dense_output)



In [ ]:

    
# all network weights (shared variables)
all_weights = lasagne.layers.get_all_params(dense_output)
print(all_weights)

Than you could simply

define loss function manually
compute error gradient over all weights
define updates
But that's a whole lot of work and life's short
- not to mention life's too short to wait for SGD to converge

Instead, we shall use Lasagne builtins



In [ ]:

    
# Mean categorical crossentropy as a loss function - similar to logistic loss but for multiclass targets
loss = lasagne.objectives.categorical_crossentropy(
    y_predicted, target_y).mean()

# prediction accuracy
accuracy = lasagne.objectives.categorical_accuracy(
    y_predicted, target_y).mean()

# This function computes gradient AND composes weight updates just like you did earlier
updates_sgd = lasagne.updates.sgd(loss, all_weights, learning_rate=0.01)



In [ ]:

    
# function that computes loss and updates weights
train_fun = theano.function(
    [input_X, target_y], [loss, accuracy], updates=updates_sgd)

# function that just computes accuracy
accuracy_fun = theano.function([input_X, target_y], accuracy)

That's all, now let's train it!

We got a lot of data, so it's recommended that you use SGD
So let's implement a function that splits the training sample into minibatches



In [ ]:

    
# An auxilary function that returns mini-batches for neural network training

# Parameters
# X - a tensor of images with shape (many, 1, 28, 28), e.g. X_train
# y - a vector of answers for corresponding images e.g. Y_train
# batch_size - a single number - the intended size of each batches

# What do need to implement
# 1) Shuffle data
# - Gotta shuffle X and y the same way not to break the correspondence between X_i and y_i
# 3) Split data into minibatches of batch_size
# - If data size is not a multiple of batch_size, make one last batch smaller.
# 4) return a list (or an iterator) of pairs
# - (подгруппа картинок, ответы из y на эту подгруппу)


def iterate_minibatches(X, y, batchsize):

    <YOUR CODE: return an iterable of (X_batch, y_batch) batches of images and answers for them>


#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# You feel lost and wish you stayed home tonight?
# Go search for a similar function at
# https://github.com/Lasagne/Lasagne/blob/master/examples/mnist.py

Training loop



In [ ]:

    
import time

num_epochs = 100  # amount of passes through the data

batch_size = 50  # number of samples processed at each function call

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    train_err = 0
    train_acc = 0
    train_batches = 0
    start_time = time.time()
    for batch in iterate_minibatches(X_train, y_train, batch_size):
        inputs, targets = batch
        train_err_batch, train_acc_batch = train_fun(inputs, targets)
        train_err += train_err_batch
        train_acc += train_acc_batch
        train_batches += 1

    # And a full pass over the validation data:
    val_acc = 0
    val_batches = 0
    for batch in iterate_minibatches(X_val, y_val, batch_size):
        inputs, targets = batch
        val_acc += accuracy_fun(inputs, targets)
        val_batches += 1

    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))

    print(
        "  training loss (in-iteration):\t\t{:.6f}".format(train_err / train_batches))
    print("  train accuracy:\t\t{:.2f} %".format(
        train_acc / train_batches * 100))
    print("  validation accuracy:\t\t{:.2f} %".format(
        val_acc / val_batches * 100))



In [ ]:

    
test_acc = 0
test_batches = 0
for batch in iterate_minibatches(X_test, y_test, 500):
    inputs, targets = batch
    acc = accuracy_fun(inputs, targets)
    test_acc += acc
    test_batches += 1
print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_acc / test_batches * 100))

if test_acc / test_batches * 100 > 99:
    print("Achievement unlocked: 80lvl Warlock!")
else:
    print("We need more magic!")

A better network ( 4+ pts )

The quest is to create a network that gets at least 99% at test set
- In case you tried several architectures and have a detailed report - 97.5% "is fine too".
- +1 bonus point each 0.1% past 99%
- More points for creative approach

There is a mini-report at the end that you will have to fill in. We recommend to read it first and fill in while you are iterating.

Tips on what can be done:

Network size
- MOAR neurons,
- MOAR layers,
- Convolutions are almost imperative
- Пх'нглуи мглв'нафх Ктулху Р'льех вгах'нагл фхтагн!

Regularize to prevent overfitting
- Add some L2 weight norm to the loss function, theano will do the rest
- Can be done manually or via - http://lasagne.readthedocs.org/en/latest/modules/regularization.html

Better optimization - rmsprop, nesterov_momentum, adadelta, adagrad and so on.
- Converge faster and sometimes reach better optima
- It might make sense to tweak learning rate, other learning parameters, batch size and number of epochs

Dropout - to prevent overfitting
- lasagne.layers.DropoutLayer(prev_layer, p=probability_to_zero_out)

Convolution layers
- network = lasagne.layers.Conv2DLayer(prev_layer, num_filters = n_neurons, filter_size = (filter width, filter height), nonlinearity = some_nonlinearity)
- Warning! Training convolutional networks can take long without GPU.
  - If you are CPU-only, we still recomment to try a simple convolutional architecture
  - a perfect option is if you can set it up to run at nighttime and check it up at the morning.
Plenty other layers and architectures
- http://lasagne.readthedocs.org/en/latest/modules/layers.html
- batch normalization, pooling, etc

Nonlinearities in the hidden layers
- tanh, relu, leaky relu, etc

There is a template for your solution below that you can opt to use or throw away and write it your way



In [ ]:

    
from mnist import load_dataset
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()

print X_train.shape, y_train.shape



In [ ]:

    
import lasagne

input_X = T.tensor4("X")

# input dimention (None means "Arbitrary" and only works at  the first axes [samples])
input_shape = [None, 1, 28, 28]

target_y = T.vector("target Y integer", dtype='int32')



In [ ]:

    
# Input layer (auxilary)
input_layer = lasagne.layers.InputLayer(shape=input_shape, input_var=input_X)

<YOUR CODE: student.code_neural_network_architecture()>

dense_output = <YOUR CODE: output of your network>



In [ ]:

    
# Network predictions (theano-transformation)
y_predicted = lasagne.layers.get_output(dense_output)



In [ ]:

    
# All weights (shared-varaibles)
# "trainable" flag means not to return auxilary params like batch mean (for batch normalization)
all_weights = lasagne.layers.get_all_params(dense_output, trainable=True)
print(all_weights)



In [ ]:

    
# loss function
loss = <YOUR CODE: loss function>

# <optionally add regularization>

accuracy = <YOUR CODE: mean accuracy score for evaluation>

# weight updates
updates = <YOUR CODE: try different update methods>



In [ ]:

    
# A function that accepts X and y, returns loss functions and performs weight updates
train_fun = theano.function(
    [input_X, target_y], [loss, accuracy], updates=updates_sgd)

# A function that just computes accuracy given X and y
accuracy_fun = theano.function([input_X, target_y], accuracy)



In [ ]:

    
# итерации обучения

num_epochs = <YOUR CODE: how many times to iterate over the entire training set>

batch_size = <YOUR CODE: how many samples are processed at a single function call>

for epoch in range(num_epochs):
    # In each epoch, we do a full pass over the training data:
    train_err = 0
    train_acc = 0
    train_batches = 0
    start_time = time.time()
    for batch in iterate_minibatches(X_train, y_train, batch_size):
        inputs, targets = batch
        train_err_batch, train_acc_batch = train_fun(inputs, targets)
        train_err += train_err_batch
        train_acc += train_acc_batch
        train_batches += 1

    # And a full pass over the validation data:
    val_acc = 0
    val_batches = 0
    for batch in iterate_minibatches(X_val, y_val, batch_size):
        inputs, targets = batch
        val_acc += accuracy_fun(inputs, targets)
        val_batches += 1

    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))

    print(
        "  training loss (in-iteration):\t\t{:.6f}".format(train_err / train_batches))
    print("  train accuracy:\t\t{:.2f} %".format(
        train_acc / train_batches * 100))
    print("  validation accuracy:\t\t{:.2f} %".format(
        val_acc / val_batches * 100))



In [ ]:

    
test_acc = 0
test_batches = 0
for batch in iterate_minibatches(X_test, y_test, 500):
    inputs, targets = batch
    acc = accuracy_fun(inputs, targets)
    test_acc += acc
    test_batches += 1
print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_acc / test_batches * 100))

if test_acc / test_batches * 100 > 99:
    print("Achievement unlocked: 80lvl Warlock!")
else:
    print("We need more magic!")

Report

All creative approaches are highly welcome, but at the very least it would be great to mention

the idea;
brief history of tweaks and improvements;
what is the final architecture and why?
what is the training method and, again, why?
Any regularizations and other techniques applied and their effects;

There is no need to write strict mathematical proofs (unless you want to).

"I tried this, this and this, and the second one turned out to be better. And i just didn't like the name of that one" - OK, but can be better
"I have analized these and these articles|sources|blog posts, tried that and that to adapt them to my problem and the conclusions are such and such" - the ideal one
"I took that code that demo without understanding it, but i'll never confess that and instead i'll make up some pseudoscientific explaination" - not_ok

Hi, my name is `_ _`, and here's my story

A long ago in a galaxy far far away, when it was still more than an hour before deadline, i got an idea:

I gonna build a neural network, that

brief text on what was
the original idea
and why it was so

How could i be so naive?!

One day, with no signs of warning,

This thing has finally converged and

Some explaination about what were the results,
what worked and what didn't
most importantly - what next steps were taken, if any
and what were their respective outcomes

Finally, after iterations, mugs of [tea/coffee]

what was the final architecture
as well as training method and tricks

That, having wasted __ [minutes, hours or days] of my life training, got

accuracy on training: __
accuracy on validation: __
accuracy on test: __

[an optional afterword and mortal curses on assignment authors]



In [ ]: